Development and external validation of machine learning algorithms for postnatal gestational age estimation using clinical data and metabolomic markers

Background Accurate estimates of gestational age (GA) at birth are important for preterm birth surveillance but can be challenging to obtain in low income countries. Our objective was to develop machine learning models to accurately estimate GA shortly after birth using clinical and metabolomic data. Methods We derived three GA estimation models using ELASTIC NET multivariable linear regression using metabolomic markers from heel-prick blood samples and clinical data from a retrospective cohort of newborns from Ontario, Canada. We conducted internal model validation in an independent cohort of Ontario newborns, and external validation in heel prick and cord blood sample data collected from newborns from prospective birth cohorts in Lusaka, Zambia and Matlab, Bangladesh. Model performance was measured by comparing model-derived estimates of GA to reference estimates from early pregnancy ultrasound. Results Samples were collected from 311 newborns from Zambia and 1176 from Bangladesh. The best-performing model accurately estimated GA within about 6 days of ultrasound estimates in both cohorts when applied to heel prick data (MAE 0.79 weeks (95% CI 0.69, 0.90) for Zambia; 0.81 weeks (0.75, 0.86) for Bangladesh), and within about 7 days when applied to cord blood data (1.02 weeks (0.90, 1.15) for Zambia; 0.95 weeks (0.90, 0.99) for Bangladesh). Conclusions Algorithms developed in Canada provided accurate estimates of GA when applied to external cohorts from Zambia and Bangladesh. Model performance was superior in heel prick data as compared to cord blood data.

Introduction Preterm birth is a leading cause of neonatal morbidity and mortality worldwide that disproportionately affects infants born in low-and middle-income countries (LMIC) [1]. Knowledge of the burden of preterm birth across LMIC is essential to evaluating the impact of policies and programs aimed at improving pregnancy and neonatal outcomes. Accurate estimates of preterm birth rates rely on the use of standard definitions across jurisdictions and consistent reporting of pregnancy outcomes. International preterm birth surveillance efforts are hampered in many LMIC by limited access to ultrasound dating technology and poor recall reliability of a woman's last menstrual period [2][3][4]. The reliability of commonly used newborn assessments for estimating gestational age (GA) postnatally is often limited in preterm and growth-restricted infants, and these methods are also subject to high inter-user variability, making estimation of the burden of preterm birth in many jurisdictions problematic [5][6][7][8][9].
There is now a push by health organizations for strengthened data surveillance systems and novel tools that can more accurately assess and monitor rates of preterm birth in low resource countries, many having some of the highest preterm birth rates globally [10,11]. Our research team has previously investigated the distribution of analyte levels obtained from newborn screening data in Ontario infants according to GA at birth and found that many analytes included in the newborn screening panel varied with level of prematurity, including a number of amino acids, acylcarnitines and fatty acid oxidation markers, 17-OHP, thyroid stimulating hormone and the relative proportions of fetal and adult hemoglobin subtypes [12,13]. Based on these findings, we developed algorithms in a North American cohort of newborns to estimate GA postnatally using newborn screening analyte levels from newborn dried blood spots and a small number of clinical covariates. We have demonstrated that these previously developed models provided accurate estimates to within about 1-2 weeks of ultrasound-based GA [14][15][16][17]. Previous models were developed using conventional multivariable linear and logistic regression methods and have been internally validated in specific ethnic subgroups of the Canadian population [18], and externally validated in a cohort of infants born in Matlab, Bangladesh [19], demonstrating satisfactory performance, but lower accuracy of GA predictions compared to the setting in which models were developed. In recent years, a number of powerful machine learning (ML) methods have been developed that allow the efficient handling of large databases, large numbers of predictors, and the incorporation of non-linear associations and complex interactions. We sought to redevelop our models by incorporating advanced ML methods with the aim of dramatically improving model performance. In this study, we report the performance of our newly developed ML-based GA estimation algorithms using ELASTIC NET regression [20] in two international prospective birth cohorts from Lusaka, Zambia, and Matlab Bangladesh.

Study design and research participants
We sought to evaluate the performance of a GA estimation algorithm developed in a retrospective population cohort of infants from Ontario, Canada in prospective birth cohorts of infants born in Lusaka, Zambia and Matlab, Bangladesh. Detailed descriptions of all three cohorts have been published previously [17,21,22]. A summary of the cohorts is available in Table 1.

Sample collection and analysis
Details of sample collection and analysis are included in the Supplemental Methods in S1 File. All models were developed and internally validated based on clinical covariates and laboratory results from heel prick blood samples in the Ontario cohort. For the Zambia and Bangladesh cohorts, a subset of infants had only a heel prick or a cord blood sample collected, and a further subset had both sample types. All samples from the Ontario, Zambia and Bangladesh cohorts were analyzed centrally at the Newborn Screening Ontario (NSO) laboratory in Ottawa, Canada. Analytes were included as candidate predictors in GA estimation models based on their routine measurement as part of Ontario's expanded newborn screening program, including hemoglobin profiles, amino acids, acylcarnitines, hormone and endocrine markers, enzymes and co-enzymes ( Table 2). Management of incidental clinical findings (screen-positive cases) for conditions screened for by the NSO program identified in the course of analyzing international samples have been reported elsewhere [23,24].

Statistical analysis
All analyses were conducted using SAS 9.4 and R 3.3.2. Additional details of statistical methods are provided in the Supplemental Methods in S1 File. Data preparation steps, including standardization and log transformation were applied uniformly in the Ontario, Zambia and Bangladesh cohorts. A complete case analysis was used for model derivation and internal

Matlab, Bangladesh
International Centre for Diarrhoeal Disease Research, Bangladesh validation in Ontario due to very low missingness and the large available sample size. A small proportion of subjects in the Zambia and Bangladesh cohorts had missing values for a small subset of analytes (due to sample quality degradation or insufficient blood spot volume) and these values were multiply imputed (additional details in Supplemental Methods in S1 File). Three distinct GA estimation models were derived. Models were fit and all parameters estimated using both the Ontario training subset (N = 79,636) and validation subset (N = 39,829) of the Ontario cohort: Model 1: Baseline model containing only infant sex, multiple birth (yes/no), birth weight (grams), and pairwise interactions among these covariates.
Model 2: Analytes model including infant sex, multiple birth (yes/no), newborn screening analytes (listed in Table 2), and pairwise interactions among covariates.
Model 3: Full model containing infant sex, multiple birth (yes/no), birth weight (grams), newborn screening analytes, and pairwise interactions among these covariates.
The analytes most strongly associated with GA were identified using partial spearman correlation analysis, and these, in addition to birthweight, were modeled using restricted cubic splines with 5 knots to allow for non-linearity. All other analytes were modeled with linear terms. To address the large imbalance in number of term infants compared to preterm and post-term infants, we applied a weighting scheme in the regression model fitting process to balance the influence of term infants with less common gestational ages (additional details in Supplemental Methods in S1 File).
Given the large number of covariates and interactions included, Model 2 and Model 3 were fit using an ELASTIC NET ML approach [20].
Final Ontario model equations were used to calculate an estimated GA in the test subset of the Ontario cohort that had no role in model development, as well as in the Zambia and Bangladesh external validation cohorts. For each infant, model performance was assessed by comparing the estimated GA from the model to the ultrasound-derived GA and calculating the mean absolute error (MAE) measured in weeks (the average of the absolute difference between model-based vs. ultrasound-based values across all observations). Lower MAE reflects more accurate model-based GA estimates. We also calculated the percentage of infants with GAs correctly estimated within 7 days of ultrasound-based GA. We assessed model performance overall and in important subgroups: preterm birth (<37 weeks gestation), and small-for-gestational age: below the 10 th (SGA10) and 3 rd (SGA3) percentile for birth weight within categories of gestational week at delivery and infant sex, based on INTERGROWTH-21 categories [25]. 95% bootstrap percentile confidence intervals were calculated for all validation performance metrics, which also took account of data imputation in the Bangladesh and Zambia cohorts.

Estimated preterm birth rate
We estimated the preterm birth rate by calculating the proportion of model-based GA estimates that were below 37 weeks, as well as 95% bootstrap confidence intervals, and compared these to the observed preterm birth rate in each cohort (based on ultrasound GA).

Classification accuracy
Although our models were intended to estimate gestational age in weeks on a continuous scale and not intended or optimized for classification, we calculated the accuracy, sensitivity and specificity of continuous Model 3 estimates when used to classify infants as term or preterm in heel prick data for Ontario, Bangladesh and Zambia. Depending on the intended application, a logistic regression or other methodology better suited to prediction of a dichotomous outcome would be preferred, and cut-points could be adjusted to prioritize sensitivity or specificity in terms of classifying infants.

Participant characteristics
The Ontario internal validation (testing subset) cohort included 39,666 infants (Fig 1). In the Ontario cohort, the proportion of infants born preterm (GA<37 weeks) based on ultrasound was 5.6%, and 3.9% and 0.92% of infants were classified as SGA10 and SGA3, respectively. In the external validation cohorts, a total of 142 heel prick samples and 265 cord blood samples collected from 311 unique newborns from Zambia, and 520 heel prick samples and 1139 cord blood samples collected from 1176 unique newborns from Bangladesh. 32 (22.5%) heel and 99 (37.4%) cord samples from Zambia were missing one or more analytes, with a maximum of three missing in any sample. 28 (5.4%) heel and 22 (1.9%) cord samples from Bangladesh were missing one or more analytes, with a maximum of 5 and 3 missing values respectively (one subject missing all analyte values was removed from the analysis). Based on ultrasound GA, the preterm birth proportions were similar in the Zambia and Bangladesh cohorts (9.6% versus 9.7%). The Zambia cohort had a much lower proportion of newborns classified as SGA10 as

PLOS ONE
compared to the Bangladesh cohort (9.7% vs. 25.3%). Characteristics of unique infants in each cohort, as well as the infants represented in heel prick and cord blood sample cohorts are provided in Table 3.

Model performance using heel prick data from Ontario, Zambia and Bangladesh
Model external validation performance was largely similar between the Zambia and Bangladesh cohorts. Accuracy of estimated GA was generally lower in the external validation cohorts than in the Ontario internal validation cohort for the same models. In general, Model 3, which included both clinical and analyte predictors, outperformed Models 1 and 2. Accuracy of GA estimates in all models was highest in term infants and tended to be lower in preterm and SGA infants, but Model 3 consistently provided the most accurate estimates across the spectrum of GA (Table 4 and    When applied to samples from preterm infants in Ontario, Model 3 correctly estimated GA to within 7 days of ultrasound-assigned values (MAE (95%CI) 1.03 (0.99, 1.06)), and performed significantly better than Models 1 and 2 (MAE (95%CI) 1.76 (1.71, 1.81) and 1.25 (1.21, 1.29), respectively). The number of preterm infants in the external validation cohorts was small, with only 11 heel prick samples from Zambia and 35 from Bangladesh, however in both settings, model 3 outperformed both models 1 and 2, and was accurate to within about 10 days on average in preterm infants (MAE (95% CI) 1.49 (1.07, 1.93) and 1.44 (1.16, 1.70), respectively in Zambia and Bangladesh).
When applied to Ontario heel prick samples from SGA infants, Model 2 had the best performance, estimating GA to within about 6 days of ultrasound-assigned values for SGA10 and 7 days for SGA3 subgroups (MAE (95% CI) 0.90 (0.86, 0.94) and 1.03 (0.92, 1.13), respectively). Our validation cohort from Zambia contained only 11 SGA10 and 3 SGA3 classified infants. When applied to these samples, Models 2 and 3 both outperformed Model 1, and estimated GA to within about 6 days of ultrasound-assigned values in SGA10 infants, and within about 9 days in SGA3 infants. Our heel prick sample cohort from Bangladesh contained a larger number of SGA samples (147 SGA10 and 61 SGA3 samples). When applied to these samples from Bangladesh, Model 3 estimated GA to within about 6 days of ultrasoundassigned values in the SGA10 subgroup and within about 7 days in the SGA3 subgroup (MAE (95% CI) 0.81 (0.72, 0.91) and 0.97 (0.81, 1.15), respectively). Scatter plots of observed GA versus estimated GA for all three models in the Ontario, Zambia and Bangladesh heel prick cohorts are presented in Fig 3, which shows that in general, lower (preterm) GAs tend to be overestimated by all three models when applied to both external cohorts, with the overestimation being much more pronounced for Model 1 estimates. Similarly, GA tended to be slightly overestimated in post-term infants. These patterns were observed in the Ontario test cohort only for Model 1.

Model performance using cord blood data from Zambia and Bangladesh
Overall, performance was attenuated when applied to cord blood samples vs. heel prick samples for Models 2 and 3 (the models including analyte covariates), in both the Zambia and Bangladesh cohorts (Table 5 and Fig 2). In the cord blood cohort from Zambia, Models 1 and 3 performed similarly (MAE (95% CI) 1.03 (0.93,1.13) vs. 1.02 (0.90, 1.15)), respectively), and significantly When applied to cord blood samples from preterm infants in Zambia (n = 22), all models generally performed poorly. The best performing model in this group was Model 1, with an MAE (95% CI) of 2.42 (1.73, 3.12) and estimated GA within 1 week in 36.3% of infants. Models performed slightly better in estimating preterm GA in cord blood samples from Bangladesh   (11 SGA10 and 3 SGA3). Model 3 had the best performance in this subgroup, performing similarly in SGA10 and SGA3 infants and correctly estimating GA to within 1 week in 54.6 (95% CI 33.3, 73.9) and 62.9% (95% CI25.0, 100.0) of cord samples, respectively. When applied to cord samples from SGA infants in Bangladesh, Model 3 also had the best performance, and correctly estimated GA to within 1 week in 60.8% (95% CI 54.9, 66.4) and 51.4% (95% CI 41.8, 60.8) of cord samples from SGA10 and SGA3 infants, respectively. Fig 4 presents scatter plots for observed versus model-estimated GA for all three models in the Zambia and Bangladesh cord blood cohorts.

Classification accuracy
In the Ontario heel prick testing cohort, 1485/2226 preterm infants were correctly classified (sensitivity = 66.7%) and 36680/37440 term infants were correctly classified (specificity = 98.0%) by Model 3. In the Bangladesh heel prick cohort, 14/35 preterm infants were correctly classified as preterm (sensitivity of 40%) and 479/485 term infants were correctly classified as term (specificity of 96.8%) by Model 3. In the Zambia heel prick cohort, 6/11 preterm infants were correctly classified (sensitivity = 54.5%) and 130/131 term infants were correctly classified (specificity = 99.2%) by Model 3. Classification accuracy was markedly lower for Models 1 and 2 in all cohorts, and all 3 models were less accurate in the cord blood cohorts from Bangladesh and Zambia compared to heel prick results.

Discussion
In this study, we demonstrated that our ML algorithms for postnatal GA estimation developed from heel prick blood sample data in Ontario, Canada, can be successfully applied in low and middle income countries in Sub-Saharan Africa and South Asia, having some of the highest preterm birth rates globally [10,11]. When applied to heel prick samples from Lusaka, Zambia and Matlab, Bangladesh, our GA estimates from Model 3 (our best-performing model) were within an average of 6 days of ultrasound-based GA. All models produced the most accurate estimates in full term infants (37 to 39 completed weeks GA), however Model 3 provided clinically important improvements (by a week or more in some instances) in accuracy over Model 1. Although Model 1's overall performance appeared satisfactory, its precision was poor in preterm infants, particularly in growth restricted infants (SGA3 and SGA10) across the spectrum of GA. This is not surprising since Model 1 relies only on sex, birth weight and multiple versus singleton birth to estimate GA. In growth restricted infants especially, birth weight is a misleading measure of GA. Model 3 demonstrated improved accuracy compared to Model 1 in SGA infants across the Ontario, Zambia and Bangladesh cohorts, in most cases with minimal to no overlap in 95% confidence intervals. In Ontario infants, point estimates for Model 3 MAE in SGA10/SGA3 infants were 1.13/1.48 respectively, compared to 2.70/3.85 for Model 1. In Zambia, MAE for SGA10/SGA3 infants was 0.87/1.29 compared to 1.79/3.54 for Model 1. In Bangladesh the MAE for SGA10/SGA3 infants was 0.81/0.97 for Model 3 compared to 1.14/ 1.73 for Model 1.
We validated the performance of our GA models, which were derived using heel-prick samples, in umbilical cord blood from the Zambia and Bangladesh cohorts. Estimation of GA using cord blood would provide several advantages. It would reduce the sample collection burden on staff, results in comparably less stress for the newborn and parents and requires less training compared to heel prick sample collection.
Unfortunately, when we applied our GA models to cord blood samples we observed sharply diminished accuracy of predictions in both the Zambia and Bangladesh cohorts. There were two main reasons for this. First, our models were derived using heel prick samples. We were unable to derive cord blood-specific models as cord blood is not routinely collected in our Ontario population. Second, our investigations in paired heel and cord samples showed that only a small subset of cord blood analyte levels are closely correlated with heel prick analytes levels in the same infants, so decreased accuracy is not surprising [26]. Hence, it may be challenging or impossible to develop a cord-blood specific model with acceptable accuracy. Nonetheless, Models 2 and 3 still demonstrated some clinical utility in estimating GA using cord blood samples.
As a sensitivity analysis, we identified analytes with high and low agreement based on Spearman correlation, and derived restricted versions of Model 3 including only analytes meeting a minimum concordance threshold. These restricted models increased the precision of cord blood GA estimates in preterm infants, but not overall (see Supplemental Results in S1 File).
In our Results, we presented scatter plots of observed versus estimated GA (Figs 3 and 4) that showed a systematic tendency for GA to be overestimated in increasingly preterm infants. This was observed for all three models in heel prick and cord blood samples from Zambia and Bangladesh. The same phenomenon was not observed in the Ontario internal validation results. There are a few possible explanations for this apparent systemic bias which can be framed as a model calibration problem. For example, our reference models employed shrinkage penalization during model fitting (via ELASTICNET regression) to control over-fitting, which may have led to over-penalized final regression model coefficients, leading to the opposite problem of under-fitting. However, this was not observed in our internal validation in Ontario data-only in the external validation cohorts. Another potential contributor is the standardization process during data preparation. The same standardization was employed in all three cohorts, but local means and standard deviations were used specific to each cohort. Local standardization was a critical step, and all models had much poorer performance when this step was omitted, however large differences in the dispersion of model predictors across populations, driven by local factors (such as socioeconomic conditions, climate, and underlying differences in birth weight and GA distributions) were likely much larger contributors to differences in covariate distributions. This could have contributed to clinically important variation being muted in the external cohorts. An important implication of this model calibration issue is that preterm (<37 weeks GA) birth estimates based on our models are too low. This is in large part due to "edge effects" which are an important limitation of dichotomizing a naturally continuous variable (for example a GA estimate of 36.9 weeks would be classified preterm while one of 37.0 would be classified as term, despite the estimates being less than a day apart). Model calibration in new external settings is an important consideration and a focus of ongoing investigation by our research team. Direct comparisons of model performance across populations and between our current ML GA estimation models and previously reported conventional regression models, is challenging due to different distributions of GAs and birth weights, and other infant-and setting-specific factors. However, comparing overall internal validation results among previously published models and those presented in this manuscript, we noted improvements in the accuracy of GA estimates overall (RMSE = 1.06 and 1.04 vs. 0.89 for current Model 3), and in preterm infants (RMSE = 1.78 and 1.35 vs. 1.16 for current Model 3) [13,14]. These findings were largely consistent across the internal and external validation cohorts. Therefore, our interpretation is that benefits of our ML approach were clear but represented incremental but clinically relevant improvements, suggesting that our previous models were robustly developed.
Our study had several important strengths. These include our strategy of both internally validating our models in Ontario test data, as well as our engagement with international collaborators to externally validate in mother-infant cohorts from low-and middle-income countries. Samples were collected from infants with GA confirmed by early pregnancy ultrasound, and analyzed centrally (in the same lab where samples were analyzed for the Ontario cohort). The Zambian Preterm Birth Prevention Study (ZAPPS) and Preterm and Stillbirth Study, Matlab (PreSSMat) cohorts, in which our study was conducted, ensured that enrollment was open to representative populations of women and newborns in both Lusaka, Zambia and Matlab, Bangladesh. Other strengths include the high quality of samples received and our collection of paired heel and cord blood samples which allowed the comparison of model performance metrics between sample types, as well analyte level comparisons in paired heel prick and cord blood samples from a large subgroup of infants. The superior accuracy of Model 3 (our best performing metabolomic model) in estimating GA in SGA and preterm infants compared to Model 1 (that only relies on clinical variables) is an important strength, given the limitations reported for commonly-used tools in the postnatal period (i.e., Ballard or Dubowitz scores) in these vulnerable infants [6,7,27]. The primary limitation of this study is the limited number of preterm infants available in our external validation cohorts, especially very and extremely preterm infants (28-32 and <28 weeks gestation respectively). There was a higher likelihood that consent would be provided for the collection of cord blood than for heel prick samples in increasingly preterm infants. This was an important finding as it highlights the advantages of further developing estimation models that work well in cord blood samples. In the Bangladesh cohort, parents expressed reluctance to subject their premature newborns to blood sample collection (via heel prick). Although this was not systematically surveyed at the Zambia site, our study nurses reported a similar hesitancy among Zambian parents. Consequently, the relatively small number of heel prick samples collected from very preterm infants limited our ability to interpret model performance in these subgroups. Our nurses also reported a stigma surrounding the heel prick blood sample collection being perceived as associated with early infant diagnosis of HIV. Although the study team focused educational efforts on dispelling this perception, it had a persistent effect on our ability to recruit the targeted number of participants.
Implementation of our postnatal GA estimation models in LMIC settings to determine population level GA distribution was the end goal of our research, but would present many challenges. The current process involving of local sample collection and international analysis has been successful in our proof of concept, external validation and feasibility work. However, implementing the analysis pipeline locally would be ideal. Many LMIC are lacking in pathology and laboratory medicine services. Barriers include insufficient infrastructure, properly trained personnel and insufficient quality, standards and accreditation education and training for laboratory and medicine personnel [28]. Many newborn screens use tandem mass spectrometry, which has limited availability in many LMIC settings.

Conclusion
Accurate ascertainment of preterm birth rates across LMIC is imperative in order to evaluate the impact of policies and programs aimed at improving pregnancy and neonatal outcomes. In North America, statistical models using data from biochemical analysis of newborn dried blood spots, including those previously developed by our team, have been shown to provide accurate estimates of GA, with some limitations in preterm and growth restricted infants [14][15][16][17][18][19]. In this study we have presented internal and external validation results of our most current ML algorithms employing ELASTIC NET regression for GA estimation in both high and low income settings, providing incremental improvements in performance compared to previously developed models. Large-scale implementation of this approach, and population-level collection and analysis of newborn samples, offers a new opportunity to provide surveillance of the burden of preterm birth in jurisdictions where data are currently lacking.
Supporting information S1 File. Supplemental study methods and results. (DOCX)